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Abstract 

Background: The information provided by dense genome-wide markers using high throughput technology is of 
considerable potential in human disease studies and livestock breeding programs. Genome-wide association 
studies relate individual single nucleotide polymorphisms (SNP) from dense SNP panels to individual measurements 
of complex traits, with the underlying assumption being that any association is caused by linkage disequilibrium 
(LD) between SNP and quantitative trait loci (QTL) affecting the trait. Often SNP are in genomic regions of no trait 
variation. Whole genome Bayesian models are an effective way of incorporating this and other important prior 
information into modelling. However a full Bayesian analysis is often not feasible due to the large computational 
time involved. 

Results: This article proposes an expectation-maximization (EM) algorithm called emBayesB which allows only a 
proportion of SNP to be in LD with QTL and incorporates prior information about the distribution of SNP effects. The 
posterior probability of being in LD with at least one QTL is calculated for each SNP along with estimates of the 
hyperparameters for the mixture prior. A simulated example of genomic selection from an international workshop is 
used to demonstrate the features of the EM algorithm. The accuracy of prediction is comparable to a full Bayesian 
analysis but the EM algorithm is considerably faster. The EM algorithm was accurate in locating QTL which explained 
more than 1% of the total genetic variation. A computational algorithm for very large SNP panels is described. 

Conclusions: emBayesB is a fast and accurate EM algorithm for implementing genomic selection and predicting 
complex traits by mapping QTL in genome-wide dense SNP marker data. Its accuracy is similar to Bayesian 
methods but it takes only a fraction of the time. 



Background 

Genome-wide association (GWA) studies are being used 
more often for risk prediction in humans and trait pre- 
diction in livestock. Such studies associate individual sin- 
gle nucleotide polymorphisms (SNP) from a dense 
genome-wide panel with between-individual variation in 
traits. The GWA provides measures of strength of asso- 
ciation and estimates of the size of the effect of each SNP 
even though SNP identified as being of predictive value 
are unlikely to be causative. These GWA studies have 
had limited success as the individual effects of loci are 
often small and relatively few loci pass the very stringent 
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statistical testing criteria imposed. The detected variants 
can be used to construct genetic profiles [1,2] but jointly 
the loci identified often explain less than 10% of the phe- 
notypic variance [2-4]. This small fraction of variance 
explained is due in part to the stringent statistical thresh- 
olds required for identification in GWA studies [5]. 
Nevertheless the scope of the genomic information pro- 
vided by high throughput technology using dense SNP 
panels remains of considerable potential. 

Researchers in other fields, in particular animal and 
plant breeding, have developed methods of prediction of 
genetic value that use all available marker information 
simultaneously and do not apply such stringent tests of 
statistical significance [6,7]. Thus, instead of testing hun- 
dreds of thousands of separate hypotheses of 'is this single 
SNP associated with the trait' as in GWA, the problem is 
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modified to 'what function of the entire SNP information 
provides the best predictor of the trait'. The outcome of 
these approaches is that many more loci are used in pre- 
diction. Although the set will now include false positive 
loci it also includes many more true positive effects and 
the overall predictive power is much improved [8]. This 
approach to genome-wide prediction is called genomic 
selection and is being applied to livestock in practice [9]. 

Different statistical approaches to genomic selection 
have been attempted. One approach is to use the markers 
to construct the realised relationship matrix, rather than 
an expected one based upon pedigree, followed by use of 
this realised relationship matrix in established BLUP pro- 
cedures [8]. When BLUP is used for genomic selection 
(hereafter called GS-BLUP) the prior distribution of the 
marker effects is assumed normal, with the variance of the 
prior distribution being equal for each marker. But this 
"equal variance" assumption is biologically unrealistic as 
many markers will lie in regions that are not involved in 
trait determination and so contribute no trait variance. 
This was the finding in [6] where simulations of genomic 
selection found that GS-BLUP was less accurate than 
Bayesian methods which allowed marker specific variances 
which cause differential shrinkage of marker effects. 

Differential shrinkage of marker effects across the gen- 
ome can be performed by assuming the marker effects 
are normally distributed with variances which are inde- 
pendent random variables following a specific distribu- 
tion. BayesA [6] assumes marker variances follow an 
inverted chi-square distribution while Bayesian LASSO 
(BayesL) [10] assumes an exponential distribution. Inte- 
grating out the variances it can be shown that the con- 
ditional distribution for the marker effects is a double- 
exponential (DE) for BayesL and a ^distribution for 
BayesA. As the DE places more density at zero than a t- 
distribution this suggests that more shrinkage will occur 
for small effects with BayesL than with BayesA. In fact 
the original LASSO [11] can be interpreted as a Baye- 
sian posterior mode when an independent DE prior is 
assigned to each marker effect as shown in equation (2) 
in [10]. However with dense marker data, many SNP 
will not contribute to predicting QTL genotypes 
through LD and the LASSO may not perform enough 
shrinkage of small marker effects to comply with this 
prior knowledge [12]. A somewhat similar conclusion 
was demonstrated in [6] for the ^distribution prior by 
comparing two Bayesian methods called BayesA and 
BayesB. BayesB used a prior mixture which assumed a 
BayesA prior for a small proportion of markers and 
allowed the rest of the marker effects to be precisely 
zero a priori. BayesB was shown to increase selection 
accuracy in simulated data when compared to BayesA. 
However this comparison has not been conducted in a 
full Bayesian analysis using a DE prior like in BayesL. 



A major problem associated with a full Bayesian analy- 
sis is the computing time required to fit the model. The 
challenge is to fit hundreds of thousands of SNP to 
many thousands of individuals with genotypes. Markov 
Chain Monte Carlo (MCMC) techniques such as Gibbs 
sampling are tractable when the dimensionality and data 
size are small. However this is not the case with dense 
SNP data and thus has led to the development of fast 
algorithms for Bayesian-like marker selection models, 
involving either heuristic approximations to fit into 
standard BLUP models [9] or an iterated conditional 
expectation (ICE) approach [13] which iterates an analy- 
tical calculation of each SNP's conditional posterior 
mean. However it is unclear in what sense the solutions 
of these fast algorithms are optimal. 

Expectation maximization (EM) algorithms can use 
the information in a prior distribution through the cal- 
culation of a maximum a posteriori (MAP) estimate [14] 
and are usually much faster than a full Bayesian 
approach. This result was demonstrated in an EM algo- 
rithm developed for implementing genomic selection 
[15]. In this paper we suggest a different formulation of 
the SNP prior mixture compared to the EM algorithm 
called wBST which was developed in [15]. This results 
in a number of advantages which will be discussed later. 
Hence this paper investigates a solution to the Bayesian 
SNP selection model through an EM algorithm which 
has a solid statistical foundation compared with the fast 
heuristic approaches. In the sections that follow (i) we 
develop an algorithm (called emBayesB) using standard 
EM theory, (ii) we propose an implementation to work 
with the dimensionality that is encountered in human 
data sets, (iii) we benchmark emBayesB by analysing a 
simulated workshop data set, and finally (iv) we explore 
the shrinkage features of emBayesB both analytically and 
graphically. 

Methods 

Data model for SNP effects 

Each of the n individuals in the study is genotyped for m 
SNP markers and has a record for a continuous trait y. 
The trait is assumed to depend on alleles of unknown 
QTL which, either directly or indirectly through LD, 
induce an association with the SNP markers. We assume 
that SNP marker / has two alleles, 0 and 1, with 1 being 
the reference allele which has a frequency pj in the n 
individuals. The three possible genotypes '0_0', '0_1' and 
'1_1' for SNP marker / are coded 0, 1 and 2 respectively, 
and are standardised by subtracting the mean (2pj) and 

dividing by the standard deviation (2p,(i -*>,)) " to produce 

the n x 1 vector of standardised frequencies b, which 
satisfies the identities l'b y = 0 and b/b, = n due to the 
standardisation which simplifies the algebra. 
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As each of the n individuals is genotyped for m SNP 
markers we can construct an n x m standardised fre- 
quency matrix B consisting of the m column vectors b y . 
We assume a linear model for the 'SNP mediated' effects 
of the QTL, namely y = Bg + e where y is the n x 1 
vector of phenotypic records, g is the m x 1 vector of 
SNP effects and e is an n x 1 vector of residuals which 
are assumed independent and identically distributed 

normal random variables i.e. e ~ N ( 0, la I j . Hence 
y|g~N(Bg,Icr e 2 ). 

Missing data and SNP prior distribution 

We assume a priori that a proportion y of the SNP mar- 
kers are in LD with at least one QTL and that an 
unknown binary variable Zj (the missing data) indicates 
whether SNP /' is in LD with QTL. That is, a priori 



Y 
■J 



for z j = 1 
for z, = 0 



(1) 



If Zj = 1 (i.e. SNP j is in LD with QTL), the SNP effect 
gj is assumed to be from a DE distribution with para- 
meter A i.e. h ( gj j = -j A exp ^ -A | gj | j where \x\ is the 

absolute value of x. If Zj = 0 (i.e. SNP / is not in LD 
with QTL), the SNP effect gj is assumed to be from a 
Dirac Delta (DD) distribution which has all its probabil- 
ity mass at zero i.e. 8(gj) if gj * 0 such that 

b 

js(g)dg = l where a <0 <b. Hence the conditional 



distribution of gj given Zj is 
Pigj I z } ) = 



j X exp | -A | 

«(«,) 



>/l) 



for Zj 
for z i 



Now the joint prior p(Zj, gj) is as follows 

P(Zj, gj) = P(Zj)p{gj | Zj) 

= (jyX exp(-l\ gj \)) Z ' ((l-y)S(g j )) 1 - Z > 



(3) 



Assuming independence of the m SNP effects, the 
joint prior for z and g is 



m 

P(z- S) = Y\p( z MSi\ z )) 



= ]^[({ r Aexp(-A|g ) |)) Z '((l- r )^ J )) ll " J, 



(4) 



Posterior distribution and EM algorithm 

Apart from a normalising constant, the posterior distri- 
bution p(z, g|y) is 



p(z.g|y) 



p(z.g)/(y|z,g) 
P(y) 



n(i^exp(-l| gj .| 



(5) 



{(i-risig,))^ 1 ' f(Y\B) 



where 



/(yls) = n^;exp 



i 

2ol 
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the data likelihood. Taking logarithms we can show that 
the log posterior is proportional to the following 



(6) 



logp(z,g|y) oc 2^ Zj log(}yA) 

j=i 

m m 

+ £( 1 - Z) .)log(i- r )-A£z i |^.| 

j=l i=l 

m 

+ £(l- Z) .)log(5( gj ))-flog<T e 2 

H 

r 

n m 



To maximize the log posterior, we use Zj as missing 
data in an EM algorithm [14]. In the E-step we evaluate 



. As log p(z, g|y) is linear in Zj, 







( J ^ 






logp 


z lg >J 








V J 





we only need to calculate E 



I g <y 



at each E-step 



(2) where * is the vector of SNP estimates at iteration k. 
Additional file 1 derives an analytical expression for 



f 






= E 


1 g ,y 




V 




J 



which is the posterior probability 



that SNP /' is in LD with at least one QTL given the 
data and the current estimates at iteration k. For the M- 

step, we fix y| and maximize £ z [log p{z, g|y)] with 
respect to the parameters gj, y, A and al ■ 

Estimators of gj, y, A and al for the M-step 

In Additional file 2, it is shown that the maximum a 
posteriori (MAP) estimate of gj is 
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= max[o, y-DE mode ] for G, >0 
= min[o, 7,-DEmode ] for Gj < 0 



(7) 



where 



b/y-X b / b i«i 



and 



y.j ; = y - ^ b ; g ; . As shown in Additional file 2, G ; is 

the maximum likelihood (cML) estimate of gj condi- 
tional on all other SNP estimates. Hence g. is a pro- 
portion of the cML estimate {Gj) after shrinking it 
toward 0 due to the DE prior for gj. If y = 1 then g. 

is the LASSO estimate of g, as the posterior mode for 
a DE prior is the LASSO [10,11]. However only a pro- 
portion of this Bayesian posterior mode is used due to 
the effect of the Dirac Delta prior where the propor- 
tion used is the posterior probability that the SNP is in 

LD with QTL. In fact as shown in Additional file 2 g. 

can be written as a weighted mean of two posterior 
modes; one is the posterior mode when the DE is the 
only prior (DE mode ) and the other is the posterior 
mode when the Dirac Delta is the only prior {DD mode ). 
That is, 

gj = YjDE mode +(l-Yj) DD mode = y*DE mode 

as the DD mode is always zero (i.e. DD mode = 0) reflect- 
ing the posterior certainty due to the Dirac Delta prior 
certainty about the SNP effect. 

It is also shown in equation (B9) of Additional file 2 

that the ML estimators of y, A and al are as follows: 



P = !l'$ k i = and 



Ue = - 



y-Bg 



a ^ 

y-Bg 



(8) 



where y is the vector of posterior probabilities at 
iteration k. 

emBayesB using Gauss-Seidel iteration 

The steps in the EM algorithm using Gauss-Seidel (GS) 
iteration are as follows: 



Step 1. Start with an initial set of values for g.,y, A. 

2 A 

and £ . For example, g . =0, £ _ q qj , (or similar 

A 2 

guessed value), ^ e = ^\-h 2 ^(jy (f° r a guessed 

as the var- 



2_2 

y 



heritability h ), and A = ^2my/h 2 a 
iance of a DE is 2/X 2 and so the total genetic var- 

2 



iance is h 2 a 2 =2my/X' 
Step 2. For SNP (/ 



b/ 



V y-X b / b *s 



l,...,m), calculate 
using 



' A k J^ 1 ^ 
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( A fe A fe " 1 ^ 






g -g 




g -g 
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g g 


V J 
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Gauss-Seidel iteration and use these G y values to cal- 
culate the posterior probabilities y*f for iteration k 
as shown in equation (A4) of Additional file 1. 
Step 3. Use the current estimates of y* to update 
the MAP estimates of gj as shown in equation (7). 
Then update the ML estimates of y, A and a 2 as 
shown in equation (8). 

Step 4. Repeat Steps 2 and 3 until convergence 
which is assessed at iteration k using the criterion 



. Small values 



of the criterion indicate that the estimates are not 
changing much relatively i.e. indicate convergence. 

If needed, fixed effects can be fitted in the model 
simultaneously with the SNP effects as explained in [13]. 

emBayesB for large SNP panels 

In Step 2 of the EM algorithm using GS we calculate all 
possible combinations of b'yb/ (i.e. B'B) each iteration. It 
is more computationally efficient to store the symmetric 
matrix B'B at the start. However this matrix is of order 
m x m which will be huge for large SNP panels. To 
avoid the calculation of B'B we use Gauss-Seidel itera- 
tion with residual update (GSRU) as described in [16] 
where it was used to avoid the calculation of B'B in a 
heuristic BLUP approach to genomic selection. Basically 
GSRU avoids the calculation of B'B by using the identity 

A A k A k 

y.j=y-X b i*i -^ h i£i = ek+l ' + h jh where 

i<j i>j 

e k+1 '' is the vector of estimated residuals at iteration 
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k+1 for the calculation of 



k+1 



Zi 



Hence to implement 



Gauss-Seidel iteration with residual update (GSRU) 
Steps 1 and 4 of the EM algorithm need no modification 
but Steps 2 and 3 need to be changed. The new Steps 2 
and 3 are: 

Step 2 GSRU . For SNP j (j = 1,..., m), calculate 
b 'e k+hi 

1 n I ' J n 



+ g and use this G, value to 



calculate the posterior probability y k+1 for iteration k + 
1 as shown in equation (A4) of Additional file 1. Then 

calculate „ using equation (7) and immediately 

( A k+i A k \ 

update e using e k+1 = e k+1, > - bj gj - gj before 

V > 

the calculation of G j+1 . The update of e results from the 

k k+i 
identity y . = e k+Lj + b . g _ = e k+i,j+i + b . g _ which 

links the two estimates of gj. 
Step 3 GSRU . Now update the ML estimates of y, A and 

a 2 using equation (8). 



loci which gave 1000 markers per chromosome. Marker 
alleles were sampled with equal probability in the foun- 
ders. QTL effects were sampled from a gamma distribu- 
tion. The genomic location and allele substitution 
effects of the 48 simulated biallelic and additive QTL 
are shown in Figure 1. More detail about the QTL 
effects is available in [18]. The number of QTL which 
explain more than 0.1, 1, 5 and 10% of the total genetic 
variation in the training data set were 28, 15, 6 and 4 
respectively. The true breeding value (TBV) of an indivi- 
dual was calculated as the sum of its QTL effects. Phe- 
notypic records were calculated for the training data set 
by adding a normally distributed residual error term to 
each individual's TBV. The variance of the normally dis- 
tributed residual error term was chosen to produce a 
heritability of 0.3 for the trait. 
Statistical analysis 

The prediction equation GEBV = Bg was determined 

for emBayesB using GS iteration by analysing the phe- 
notypes and SNP genotypes of the 4665 individuals in 
the training data set. The number of SNP analysed was 
5726 as only SNP with a minor allele frequency greater 
than 0.05 were used. The initial parameter estimates 



As mentioned in [16], e should be recalculated peri- assumed for emBayesB were g . = 0, * = 0 Q1 , plus £ 



odically (e.g. at each iteration) using p k+i 

numerical errors can 
gested for updating e. 



e-=y-Bg as 

numerical errors can accumulate in the procedure sug- 



Simulation example 

To benchmark the capabilities of emBayesB the SNP 
data distributed to participants of the QTLMAS XII 
workshop was analysed. A summary of the data simula- 
tion is given here, with full details available in [17]. An 
initial population of 50 male and 50 female founder 
individuals was created. For the next 50 generations, 50 
males and 50 females were produced by random sam- 
pling parents each generation. For the last six genera- 
tions, 15 males and 150 females were selected randomly 
for a hierarchical mating, with each male mated ran- 
domly to 10 females who produce 10 progeny each, giv- 
ing a total of 1500 pedigreed progeny per generation. 
The 1200 individuals in the validation data set consisted 
of a random sample of 400 progeny from each of the 
last three generations. The 4665 individuals in the train- 
ing data set were progeny from the preceding four gen- 
erations; three generations of 1500 progeny plus the 
initial 15 males and 150 females. The training data set 
contained both SNP genotypes and phenotypic records, 
while the validation data contained only SNP genotypes. 

There were 6000 biallelic marker loci on six 100 cM 
chromosomes with a 0.1 cM spacing between marker 



and A for an observed phenotypic variance of 4.42 

and heritabilities of 0.1, 0.3, 0.5, 0.7 and 0.9. The algo- 
rithm was stopped when the convergence criteria was 
less than 1 x 10' 8 . The prediction equation was used to 
calculate the GEBV of the 1200 individuals in the valida- 
tion data set using only the genotype of their 5726 SNP. 
The accuracy of the prediction equation was determined 
by correlating GEBV and TBV separately for each of the 
3 generations (400 individuals) of the validation data 
and combined over all 3 generations. The linear regres- 
sion of TBV on GEBV was also calculated as a slope of 
one indicates that the GEBV are unbiased. Spearman's 
rank correlation was calculated for the top 10% of indi- 
viduals ranked on TBV in the validation data. 

GEBV were also calculated for GS-BLUP, LASSO and 
the ICE algorithm. The estimated SNP effects for GS- 
BLUP were solutions to | B'B + a I j g = B'y , where 

a =<Tg JcTg = m{^l-h 2 ^jh 2 . LASSO estimates where 
calculated using emBayesB by fixing y = 1 in each analy- 
sis, and also by fixing A and a 2 at their initial values. 
Details of the ICE algorithm are given in [13]. ICE uses 
fixed values of y, A and a 2 ■ The Fortran 90 source 
code and Windows executable of the emBayesB 
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algorithm (plus GS-BLUP, LASSO and ICE) can be 
found in Additional file 3. 

The emBayesB algorithm had difficulty with estimation 
of A for some heritabilties. This is probably a reflection 
of the flat likelihood surface for estimating A particu- 
larly when combined with estimating y. Hence an 
upper bound was placed on A in each analysis with the 
upper bound being the corresponding A used as the 
initial value for the LASSO. If the bound was reached 
then the current estimate of A was reset to its initial 
value. This procedure seemed to produce a good 
searching algorithm for parameter estimation with 
emBayesB given the complexity of the likelihood 
surface. 

Results 

Comparison of methods using simulated data 

The emBayesB algorithm, and indeed all methods in 
Table 1, took only a few minutes to converge on a 2 
GHz laptop PC for the 6 k SNP panel simulated. This 
was considerably faster than a full Bayesian analysis 
similar to [6] which took approximately 2 days (R. 
Pong-Wong, pers. comm.). A similar difference in com- 
puter time was reported in [13] where ICE was com- 
pared with a full Bayesian analysis (an MCMC 
implementation of the BayesB algorithm). 

emBayesB was the most accurate method of predicting 
TBV in the validation data over all heritabilities (Table 1). 
The emBayesB correlation of 0.88 between GEBV and 
TBV for all 1200 individuals was similar to correlations of 
0.84 to 0.87 for Bayesian MCMC methods performed on 
the same data, but larger than correlations of 0.5 to 0.77 



for various BLUP models [17]. GS-BLUP produced corre- 
lations of 0.75, 0.71 and 0.66 for heritabilities 0.3, 0.5 and 
0.7 respectively (Table 1). Using the top 10% of individuals 
ranked on TBV in the validation data, the calculated Pear- 
son correlation was 0.51 for emBayesB, while the rank cor- 
relation between GEBV and TBV was 0.41 when initial 
heritability was 0.5. This rank correlation was lower than 
the rank correlations of 0.46 to 0.58 for Bayesian MCMC 
methods applied to the same data [17] but larger than the 
GS-BLUP rank correlation of 0.27. 

ICE with y = 0.01 produced a correlation of 0.87 when 
heritability was 0.1 (Table 1). However the correlations 
for ICE decreased as initial heritability increased, 
whereas for emBayesB the correlations remained con- 
stant due to the ability of the EM algorithm to estimate 
the unknown parameters. If the emBayesB parameters y, 

A and <y^ were fixed at their initial values then the cor- 
relations for emBayesB were practically identical to 
those for ICE (Table 1). Predicting TBV separately for 
each generation it was found that the accuracy for both 
ICE and GS-BLUP decreased considerably by the 3 rd 
generation whereas the accuracy for emBayesB 
decreased very little over generations (Table 1). 

The LASSO produced similar correlations to emBayesB 
when heritability was 0.3 and 0.5, but smaller correlations 
when heritability was 0.1, 0.7 and 0.9 (Table 1). Heritabil- 
ities of 0.1, 0.3, 0.5, 0.7 and 0.9 correspond to A values of 
161, 93, 72, 61 and 54 respectively for the LASSO. As A 
decreases the LASSO performs less shrinkage such that 
the number of non-zero LASSO estimates of SNP effects 
increases and was 20, 57, 132, 233 and 1029 as 



Table 1 Correlation and regression coefficient of TBV on GEBV for various generations of the validation data. 



ICE emBayesB 



Gen" 


h 2fc 


GS-BLUP 


7 = 0.07 


7=7 


LASSO y = 1, ( I, ol ) c 






All e 


0.1 


0.75 f (1.19) 9 


0.87 (0.89) 


0.79 (1.22) 


077 (1.49) 


0.87 (0.89) 


0.88 (1.13) 


All 


0.3 


0.75 (0.85) 


0.85 (0.86) 


0.79 (0.92) 


0.87 (1.15) 


0.85 (0.86) 


0.88 (1.05) 


All 


0.5 


0.71 (0.69) 


0.81 (0.79) 


0.76 (0.78) 


0.87 (1.00) 


0.80 (0.78) 


0.88 (0.99) 


All 


0.7 


0.66 (0.55) 


0.74 (0.67) 


0.72 (0.65) 


0.77 (0.75) 


0.74 (0.67) 


0.87 (0.91) 


All 


0.9 


0.57 (0.38) 


0.58 (0.43) 


0.55 (0.35) 


0.57 (0.38) 


0.58 (0.43) 


0.87 (0.90) 


1 


0.5 


0.74 (0.71) 


0.84 (0.82) 


0.78 (0.78) 


0.87 (1.00) 


0.84 (0.82) 


0.88 (0.99) 


2 


0.5 


0.73 (0.71) 


0.81 (0.81) 


0.78 (0.81) 


0.87 (1.03) 


0.80 (0.80) 


0.88 (1.02) 


3 


0.5 


0.68 (0.62) 


0.77 (0.71) 


0.74 (0.72) 


0.85 (0.92) 


0.77 (0.70) 


0.86 (0.92) 



Correlation between TBV and GEBV, and the regression coefficient (in brackets) of TBV on GEBV for each generation, and for all three generations combined, of 
the validation data for GS-BLUP, ICE and emBayesB. Unless indicated otherwise, the initial parameter estimates are y = 0.01, v 1 , =[i-k 1 )<j 2 y and i = (2mr/h 2 o 2 ) x 
where a* = 4.42 is the total phenotypic variance in the training data. The true heritability was 0.3 
" Generations of the validation data used. 
b Initial heritability assumed. 

c Parameters in brackets are fixed at the initial values. 

d Parameters in brackets are all estimated. 

e All three generations combined. 

' Correlation between TBV and GEBV. 

9 Regression coefficient (in brackets) of TBV on GEBV. 
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heritability increased in Table 1. In practice the A value 
would usually be determined by cross validation for the 
LASSO. When heritability was 0.3 or 0.5, the LASSO cor- 
relations decreased very little across the three generations 
similar to emBayesB (Table 1). Using y = 1 the ICE algo- 
rithm was not able to match the performance of the 
LASSO which used a fixed y = 1 in the emBayesB algo- 
rithm with all other parameters fixed (Table 1). The rea- 
son for this result is illustrated graphically later. 

The regression of TBV on GEBV was biased for GS- 
BLUP and ICE for all heritabilities in Table 1. For 
emBayesB and the LASSO the regression of TBV on 
GEBV was only unbiased when heritability was 0.5 
although emBayesB displayed the least bias for each 
heritability. 

For an initial heritability of 0.5, the final parameter esti- 
mates were y=0 .023> 1 = 31.41 and a] = 3.08 for 
emBayesB. If we assume the number of SNP in LD with 
QTL is 48, then the true parameters are / = 48/5726 = 

0.0084, X = (2my/h 2 cr 2 )' A =((2x48)/(0.3x4.42))' A =8.51 
and a 2 = y\ -h 2 ^a 2 =3.09 . The estimated genetic var- 
iance ' a _ 2m y 1 1 X = 0 264 was an underestimate of the 



true genetic variance a 2 = h 2 a 2 = 1.33 . These estimates 

produced a heritability of 0.08 whereas the true heritability 
was 0.3. This underestimation is not surprising given the 
incomplete LD between SNP and QTL. This helps explain 
why ICE produced its largest correlation between TBV and 
GEBV for a heritability of 0.1 in Table 1. 

SNP results for emBayesB when h 2 = 0.5 

The SNP results that follow were obtained using 
emBayesB with an initial heritability of 0.5. As expected 
most SNP have a small posterior probability of being in 
LD with at least one QTL (Figure 1). In fact 5660 SNP 
have posterior probabilities less than 0.1, while only 27 
SNP have posterior probabilities greater than 0.5. 
emBayesB detected all QTL with allele substitution 
effects greater than 0.18 by calculating posterior prob- 
abilities of 0.98 or more for nearby SNP (Figure 1). On 
chromosome 6 all SNP have posterior probabilities less 
than 0.22 which was in accord with the absence of QTL 
simulated on this chromosome. 

Of the 48 QTL simulated, there were 15 QTL which, 
individually explained more than 1% of the total additive 
genetic variation, and in total, explained over 95% of the 
additive genetic variance. emBayesB detected each of 
these 15 QTL by calculating posterior probabilities of 
0.99 or more for nearby SNP (Figure 2). The distance 




Chromosome location (cM) 

Figure 1 48 QTL effects and 5726 SNP posterior probabilities of being in LD with QTL Average effect of an allelic substitution in the 
training data set (H) plotted against genomic location for each of the 48 QTL Also the SNP posterior probability (+) of being in LD with at least 

one QTL plotted against genomic location for each of the 5726 SNP. The QTL effects are in absolute values. 

v ^ ) 
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from each of these 15 QTL to the nearest high probabil- 
ity SNP averaged 0.7 cM, with the largest distance being 
1.7 cM. Three QTL each explained more than 12% of 
the genetic variation and this large variation resulted in 
multiple nearby SNP having posterior probabilities of 1 
(Figure 2). 

There were 25 SNP with posterior probabilities greater 
than 0.9 and the distance averaged 0.85 cM from each of 
these 25 SNP to the nearest QTL explaining more than 
1% of the total genetic variation. As the genetic variation 
explained by a QTL dropped below 1%, the posterior 
probability of nearby SNP decreased toward zero. Hence 
in this simulation it was found that the SNP posterior 
probabilities could be used to accurately locate QTL 
explaining more than 1% of the total genetic variation. 

In general the SNP used for prediction were different 
for emBayesB and the LASSO. For example with an 
initial heritability of 0.5, the number of estimated SNP 
effects greater than 0, 0.01 and 0.1 was 2841, 15 and 10 
for emBayesB compared with 132, 72 and 6 for the 
LASSO. However the LASSO did use SNP which 
emBayesB estimated as having a non-zero posterior prob- 
ability of being in LD with QTL. For example, the LASSO 
used 57 and 132 non-zero estimates of SNP effects for 
heritabilities of 0.3 and 0.5 respectively, and these SNP 
had average posterior probabilities of 0.31 and 0.16 of 
being in LD with QTL as estimated by emBayesB. 



Analytical emBayesB shrinkage 

In this section we graphically explore features of 
emBayesB in order to assist with understanding how 
the algorithm works. Figure 3 shows the shape of the 
conditional posterior distribution of gj given in equation 
(A2) of Additional file 1. The graphs assume 

y = 0.05, A. = 10, <jg = 1 and n = 500 plus we have used 

a DE with A s = 1000 (i.e. a Spike at 0) to replace the 
Dirac Delta function as done in Additional file 2. The 
mixture prior in Figure 3 is given in equation (Al) of 
Additional file 1. We call the function h(gj\Gj, a 2 ) in Fig- 
ure 3 a Likelihood as it is the normally distributed con- 
ditional likelihood derived in Appendix 2 of [13]. 

When the cML estimate (G.) of gj is far from 0 the 
conditional posterior resembles the conditional likeli- 
hood, but is slightly shifted (or shrunk) toward 0. The 
mode of the shrunk likelihood is Gj-Xa 2 (=Gj-0.02) when 
Gj is much greater than 0. This is the LASSO estimate 
as the Spike has little influence when G y is far from 0. 
However as G y approaches 0, the conditional posterior 
becomes bimodal, with the height of the mode at 0 
increasing the closer G ; is to 0 (Figure 3). This reflects 
the fact that, the closer G ; is to 0, the higher is the 
probability that the true gj is 0. Using numerical integra- 
tion in equation (A3) it can be shown that the area 
under the DE part of the conditional posterior is 0.99, 
0.67 and 0.18 for Gj values of 0.19, 0.15 and 0.11 as 
shown in Figure 3. In the EM algorithm this DE area is 
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Figure 3 Graphical illustration of how a posterior probability is calculated for a SNP. Graphs of the mixture prior p(gj), conditional 
ikelihood h(gj\Gj, cr) and conditional posterior distribution p(g y |g_^y) as given in equations (A1) and (A2) of Additional file 1 for 
y = 0.05, X = 10, <7g = 1 ar| d n = 500. The Dirac Delta function is replaced by a DE with A 5 = 1000 i.e. a Spike at 0. The posterior 
probability y, of SNP j being in LD with QTL is calculated from equation (A3) by numerical integration. Figures A and B show the distributions 
when Gj is 0.19 and 0.1 1 respectively, where G, is the conditional maximum likelihood estimate of g t 



, the posterior probability that SNP / is in LD with at 
least one QTL given the data and all other current esti- 
mates at iteration k. 

Using numerical integration it can also be shown 
that the mean of the conditional posterior is 0.1677, 
0.0868 and 0.0165 for G, values of 0.19, 0.15 and 0.11 
respectively, while the MAP estimates of gi (calculated 
using equation (7)) are 0.1677, 0.0868 and 0.0163 for 
the same values of Gj. So the MAP estimate of gj is an 
accurate estimate of the conditional posterior mean. 
Hence at convergence, it is reasonable to expect that 
the MAP estimate will be an accurate estimate of the 
marginal posterior mean of gj. Bayesian MCMC meth- 
ods use the marginal posterior mean of each SNP in 

the prediction equation GEBV = Bg, whereas 

emBayesB uses the MAP estimate given in equation 
(7). Hence it is not surprising to find that emBayesB 
has a similar accuracy of prediction compared to Baye- 
sian MCMC methods as found in the analysis of the 
simulated workshop data. 

The analytical relationship between the conditional 
posterior mean E(gj | g^-,y) and the MAP estimate of gj 
is explored further in Figure 4. The analytical deriva- 
tion of E{gj | g. ; ,y) is given in Appendix 1 of [13], 
while the MAP estimate of gj is calculated using equa- 
tion (7) with Yj given by equation (A4). Plots of the E 
(gj I 8-/>y) versus Gj are given in Figures 4A and 4C, 

while plots of the MAP estimate g . versus G y are 

given in Figures 4B and 4D. The most striking feature 
is the similarity of the paired curves when comparing 
Figures 4A and 4B (A = 1.0 and the same y), or when 
comparing Figures 4C and 4D (the same A and 



y = 0.1). Once again it seems that the MAP estimate 
of gj is an accurate estimate of the conditional poster- 
ior mean as found earlier. The difference between the 
paired curves is largest when y = 1 and G y is close to 0 
as can be seen in Figures 4A and 4B. 

When y = 1 in Figure 4B, the MAP curve resembles a 
broken stick which is absolutely flat around the origin. 
This is the LASSO estimate which is a broken stick for 



all values of A. The LASSO's 



estimate is shrunk the 



constant amount of Act 2 (= 1 in Figure 4B) from the 
cML estimate of Gj as shown in equation (7) when G, is 
past the break in the stick. As the value of y decreases 
in Figure 4B, the asymptotic value of Gj±Xa 2 {=DE mode ) 
is shrunk even more, and in a non-linear manner, as Gj 
approaches the origin, with greater shrinkage for smaller 
y values. This is due to the a priori belief that a propor- 
tion (1 -y) of the SNP are 0 and so small values of G y 
are more probably 0, and so shrunk more, as y 
decreases. In fact the shrinkage is proportional to Yj as 
shown in equation (7). 

When y = 0.1, the MAP curves show strong shrinkage 
to 0 for any G, values between -2a and 2a for all values of 
A (Figure 4D). Even more shrinkage of small G ; values 
occurs when y = 0.01 as the Gj interval increases to (-3a, 
3a) as shown in Figure 4B. As A increases in Figure 4D, 
the variance of the DE distribution gets smaller which 
results in a smaller total genetic variance for fixed m and 
y. Hence we need more shrinkage i+Xa 2 ) of large |G ; | 
values as shown by the different asymptotes in Figure 4D. 

Discussion 

This study has developed a fast EM algorithm for gen- 
ome wide prediction in which there is a joint prediction 
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of breeding value from accumulated SNP data. The ben- 
efits of the algorithm are its fast performance, its verity 
in relation to the proposed model, and the optimality 
properties it brings from application of the EM algo- 
rithm. The time advantage of emBayesB over a full 
Bayesian analysis is expected to be even greater with 
dense 500 k SNP panels currently being used in GWA 
studies. A disadvantage of emBayesB is that no standard 
errors are routinely available from an EM algorithm. 
However there are methods of obtaining standard errors 
with an EM algorithm [14] and even bootstrapping is a 
possibility given the fast performance. 

The predictive power of emBayesB comes from the 
use of the information-rich prior mixture distribution 
which is of particular value when the number of QTL is 
small relative to the number of independent genomic 
segments [19]. In fact it is expected that there will be 
no advantage in using emBayesB over GS-BLUP if the 
simulated QTL more closely fit an infinitesimal model. 
As with other recent studies [10,11] emBayesB uses a 
DE prior distribution for QTL effects which has some 
experimental justification [8]. In addition emBayesB 
incorporates a priori that an unknown proportion of 
SNP will not be in LD with QTL through the use of the 
Dirac Delta function in the prior mixture distribution 
for the SNP effects. This SNP prior mixture is quite dif- 
ferent to that used in the EM algorithm wBSR in [15] 



where the Dirac Delta function was not used to model 
the absence of LD. The wBSR algorithm derived in [15] 
is only an approximate EM algorithm due to the 
approximation used to include the missing data variable 
Yi (the SNP weight) into the EM modelling process. 
Using a Dirac Delta function in the prior mixture seems 
a more theoretically attractive way of modelling the LD 
between SNP and QTL and produces some appealing 
analytical results like the posterior probability formula 
in equation (A4) and the result that the best estimate of 
a SNP effect can be viewed as a regressed DE mode as 
shown in equation (7). 

emBayesB is an EM algorithm which has similarities 
with the fast heuristic algorithm called ICE [13]. ICE uses 
the same formulation of the data model and the SNP 
prior distribution but iterates on the mean of each SNP 
effect conditional on all the other SNP mean effects, the 

y data and assuming fixed values for y, X and ■ It is 

unknown in general how optimal ICE solutions are. But 

if the fixed values of y, A and <j 2 assumed in ICE are set 

equal to the ML estimates obtained from emBayesB then 
we have found that the prediction accuracy of ICE is 
identical to the prediction accuracy of emBayesB (e.g. see 
h 2 = 0.1 in Table 1). This seems to reinforce the conclu- 
sion drawn from Figure 4 that the posterior mean of a 
SNP effect is well approximated by the MAP estimate in 
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equation (7). Hence it is no surprise to find that the accu- 
racy of prediction calculated in the simulated example of 
[13] was similar for ICE and a Bayesian MCMC imple- 
mentation of the BayesB model as ICE assumed fixed 

values of y, A and <j 2 which were close to optimal. 

The simulated example of [13] used an 8010 SNP 
panel with 1000 individuals in the training and valida- 
tion data sets. The speed advantage of ICE was large; 
ICE converged in 2 to 5 minutes compared to 47 hours 
for the Bayesian MCMC analysis. The computational 
speed advantage of ICE comes from the analytical calcu- 
lation of the conditional posterior mean; emBayesB uses 
a similar analytical calculation for the conditional pos- 
terior probability. As ICE and emBayesB took similar 
amounts of CPU time in Table 1, the results for ICE in 
[13] provide further evidence of the computational 
speed advantage of emBayesB over a full Bayesian 
analysis. 

emBayesB is similar to the empirical Bayes method 
suggested by [20] where Bayesian hyperparameters are 
estimated by marginal and conditional maximum likeli- 
hood methods. Taking an empirical Bayes approach in a 
wavelet regression application, [21] used marginal maxi- 
mum likelihood with various prior mixtures involving 
the Dirac Delta function (including the DE as in 
emBayesB) to evaluate shrinkage of wavelet noise. They 
compared the posterior mean and posterior median as 
shrinkage methods and showed that the posterior med- 
ian, unlike the posterior mean, produces a threshold 
rule for estimation in that estimated wavelet coefficients 
below a calculated threshold were set to exactly zero. 
The emBayesB estimate of a SNP effect is also calcu- 
lated using a thresholding rule (see equation (7) and 
Figure 4). As with emBayesB, the empirical Bayes meth- 
ods of [21] combine fast computation with good theore- 
tical properties. 

The simulated example used in this paper did not 
show any advantage for emBayesB over the LASSO. 
However in a simulated example of wavelet denoising, 
[22] demonstrated an advantage over the LASSO of 
both a Bayesian sigmoid model and the empirical Bayes 
method of [21] which uses a DD+DE mixture prior like 
in emBayesB. In fact various methods for shrinking coef- 
ficients in regression models were compared in [22] 
including the Bayesian sigmoid model which has a single 
hyperparameter to tune the shrinkage. The bimodal nat- 
ure of the marginal posterior for a regression coefficient 
in the Bayesian sigmoid model (Figure 2 in [22]) is very 
similar to the bimodal nature of the conditional poster- 
ior distribution of a SNP effect as shown in Figure 3. 
The shape of the shrinkage graph for the Bayesian sig- 
moid model (Figure 4 in [22]) is also similar to the 



emBayesB shrinkage graph when y is small and A is 
small (see Figure 4D with y = 0.1, A = 0.1). However 
emBayesB will estimate values for y and A, and so, 
unlike the Bayesian sigmoid model, emBayesB can adapt 
its shrinkage such that it is appropriate for the prevail- 
ing nature of the data like in Figure 4. 

Conclusions 

This paper reports an EM algorithm called emBayesB for 
genome wide prediction in which there is a joint predic- 
tion of breeding value from dense SNP marker data. A 
formulation of the emBayesB algorithm using GSRU is 
developed to handle large SNP panels. Using a simulated 
and widely available dataset it was found that the accu- 
racy of emBayesB was similar to Bayesian approaches, 
but emBayesB took only a fraction of the computational 
time. Using emBayesB may be a promising solution to 
the problem found in GWA studies with the use of strin- 
gent statistical thresholds. The emBayesB calculation of 
posterior probabilities of SNP being in LD with QTL may 
also be useful in the area of SNP subset selection. Due to 
the fast computational speed, opportunities exist with 
emBayesB to explore fitting innovative models which 
could include non-additive genetic variation or even 
simultaneous fitting of multiple traits. More research is 
needed to explore the opportunities which emBayesB 
offers and to benchmark its capabilities. 

Availability and requirements 

The simulated data analysed in the paper is available on 
the 12 th QTLMAS workshop website http://www.com- 
putationalgenetics.se/QTLMAS08/QTLMAS/DATA. 
html. The program emBayesB is available both as For- 
tran 90 source code and as a Windows executable in 
Additional file 3. 
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